ASR corpus design for resource-scarce languages

نویسندگان

  • Etienne Barnard
  • Marelie H. Davel
  • Charl Johannes van Heerden
چکیده

We investigate the number of speakers and the amount of data that is required for the development of useable speakerindependent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Harvesting of Internet Audio for Resource-Scarce ASR

Spoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages ...

متن کامل

Efficient data selection for ASR

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many language...

متن کامل

Design of Cross-lingual and Multilingual Corpora for Speaker Recognition Research and Evaluation in Indian Languages

Automatic Speaker Recognition (ASR) is an economic method of biometrics because of the availability of the low cost and powerful processors. Results of ASR are highly dependent on database, i.e., the results obtained in an ASR system are meaningless if the recording conditions are not of standard. In this paper, a methodology and a typical experimental setup used for development of corpora for ...

متن کامل

CorpusCollie - A Web Corpus Mining Tool for Resource-Scarce Languages

This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is ab...

متن کامل

Automatic Construction of the Finnish Parliament Speech Corpus

Automatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-theart deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of trans...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009